Modeling¶
In [1]:
import os
import sys
import warnings
import joblib
import mlflow
import numpy as np
import plotly
import plotly.express as px
import polars as pl
import shap
from optuna.integration.mlflow import MLflowCallback
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import average_precision_score, roc_auc_score, roc_curve, auc, precision_recall_curve
from sklearn.model_selection import cross_val_score, train_test_split
# Path needs to be added manually to read from another folder
path2add = os.path.normpath(
os.path.abspath(os.path.join(os.path.dirname("__file__"), os.path.pardir, "utils"))
)
if not (path2add in sys.path):
sys.path.append(path2add)
from modeling import evaluate_thresholds, tune_hgbt
In [3]:
%load_ext autoreload
%autoreload 2
In [5]:
plotly.offline.init_notebook_mode()
In [7]:
warnings.filterwarnings("ignore")
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("api_anomaly")
mlflow.sklearn.autolog(disable=True)
Data¶
In [10]:
eng_datadf = pl.read_parquet('../data/features_clean_data_supervised.parquet')
eng_datadf.sample(3)
Out[10]:
shape: (3, 22)
| _id | inter_api_access_duration(sec) | api_access_uniqueness | sequence_length(count) | vsession_duration(min) | ip_type | num_sessions | num_users | num_unique_apis | source | classification | is_anomaly | std_local_source_degrees | avg_global_source_degrees | max_global_dest_degrees | max_global_source_degrees | std_global_source_degrees | min_global_source_degrees | avg_global_dest_degrees | n_connections | min_global_dest_degrees | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | f64 | f64 | f64 | i64 | str | f64 | f64 | f64 | str | str | bool | f64 | f64 | u32 | u32 | f64 | u32 | f64 | u32 | u32 |
| 398 | "c7865e57-ad1f-3be7-a51c-67a612… | 0.000903 | 0.002501 | 28.735572 | 5938 | "default" | 4553.0 | 3812.0 | 274.0 | "E" | "normal" | false | 19.361173 | 5386.694182 | 22416 | 32071 | 7321.157764 | 9 | 5941.115566 | 1272 | 12 |
| 144 | "678bb2c5-efb4-3701-a992-9c6f64… | 0.028319 | 0.011287 | 16.407407 | 4517 | "default" | 190.0 | 162.0 | 30.0 | "E" | "normal" | false | 3.077491 | 7577.265823 | 22416 | 32071 | 9757.336117 | 215 | 8220.379747 | 79 | 173 |
| 399 | "d3e64ce1-0a5e-3403-a4d6-626a39… | 0.000155 | 0.004298 | 19.311284 | 139 | "default" | 1016.0 | 771.0 | 64.0 | "E" | "normal" | false | 6.979811 | 6550.306977 | 22416 | 32071 | 9576.245138 | 9 | 7012.967442 | 215 | 9 |
In [13]:
label = "is_anomaly"
numerical_features = [
"std_local_source_degrees",
"avg_global_source_degrees",
"max_global_dest_degrees",
"max_global_source_degrees",
"std_global_source_degrees",
"min_global_source_degrees",
"avg_global_dest_degrees",
"n_connections",
"min_global_dest_degrees"]
eng_datadf = eng_datadf.filter(pl.col('ip_type') == 'default').select([label] + numerical_features) #
eng_datadf.sample(3)
Out[13]:
shape: (3, 10)
| is_anomaly | std_local_source_degrees | avg_global_source_degrees | max_global_dest_degrees | max_global_source_degrees | std_global_source_degrees | min_global_source_degrees | avg_global_dest_degrees | n_connections | min_global_dest_degrees |
|---|---|---|---|---|---|---|---|---|---|
| bool | f64 | f64 | u32 | u32 | f64 | u32 | f64 | u32 | u32 |
| false | 4.261044 | 10470.013889 | 22013 | 32071 | 10746.215313 | 333 | 7909.236111 | 72 | 118 |
| true | 2.454816 | 8807.18 | 22416 | 32071 | 10666.13462 | 5 | 9944.12 | 50 | 4 |
| false | 14.182674 | 5001.503198 | 22416 | 32071 | 7491.148504 | 11 | 5371.692964 | 938 | 6 |
In [15]:
X_train, X_test, y_train, y_test = train_test_split(eng_datadf[numerical_features], eng_datadf[label].to_list(), test_size=0.2)
In [17]:
print("Train shape: ", X_train.shape)
print("Test shape::", X_test.shape)
Train shape: (1233, 9) Test shape:: (309, 9)
Baseline¶
Can this problem be solved without ML?
In [20]:
from sklearn.metrics import f1_score
heuristic_f1_scores = []
possible_values = X_train['n_connections'].sort().unique().to_list()
for v in possible_values:
heuristic_pred = X_test.select(pl.col('n_connections') <= v).to_pandas()
heuristic_f1_scores.append(f1_score(y_test, heuristic_pred))
In [21]:
px.line(
x=possible_values,
y=heuristic_f1_scores,
labels={
"x": "Number of Connections Threshold",
"y": "F1 Score",
},
title='F1 Score for Heuristic Rule'
)
Insights¶
- The optimal number of connections to set as a threshold is 71
- Heuristic rule achieves an F1 score of 0.69
Let's compare this with an ML model
In [26]:
with mlflow.start_run(run_name='GBT_baseline'):
params = {
"learning_rate": 0.1,
"max_iter": 10,
"max_leaf_nodes": 10,
"max_depth": None,
"l2_regularization": 0,
"class_weight": 'balanced'
}
mlflow.set_tag("model_name", "HGBT")
mlflow.log_params(params)
gbt = HistGradientBoostingClassifier(**params)
gbt.fit(X_train, y_train)
roc_auc = cross_val_score(gbt, X_train, y_train, cv=5, scoring='roc_auc').mean()
print('ROC AUC (avg 5-fold):', roc_auc)
mlflow.log_metric("roc_auc", roc_auc)
mlflow.sklearn.log_model(gbt, "gbt_models", input_example=X_train.to_numpy())
ROC AUC (avg 5-fold): 0.972685022145271
Downloading artifacts: 0%| | 0/7 [00:00<?, ?it/s]
Observations¶
- The ROC AUC achieved by the initial model is very high (0.97)
Insights¶
- This could indicate the task being very simple, or target leakage
- We do not know how the target was created, so we can assume the dataset creators used simple/reliable anomaly detection methods
- We do not necessarily need to do hyperparameter tuning (0.96 is sufficient for the business use case), but running a few experiments can show how well this model can perform
Hyperparameter Tuning¶
In [30]:
mlflc = MLflowCallback(
tracking_uri="sqlite:///mlflow.db",
metric_name="roc auc",
)
best_trial = tune_hgbt(20, mlflc, X_train, y_train)
[I 2024-10-14 19:45:28,741] A new study created in memory with name: hgbt_tuning
[I 2024-10-14 19:45:29,345] Trial 0 finished with value: 0.985240170055626 and parameters: {'max_iter': 59, 'max_leaf_nodes': 11, 'max_depth': 9, 'l2_regularization': 5.211568989875005}. Best is trial 0 with value: 0.985240170055626.
ROC AUC (avg 5-fold): 0.985240170055626
[I 2024-10-14 19:45:29,810] Trial 1 finished with value: 0.9874680546222769 and parameters: {'max_iter': 78, 'max_leaf_nodes': 19, 'max_depth': 4, 'l2_regularization': 1.524314694547202}. Best is trial 1 with value: 0.9874680546222769.
ROC AUC (avg 5-fold): 0.9874680546222769
[I 2024-10-14 19:45:30,448] Trial 2 finished with value: 0.9840093322411929 and parameters: {'max_iter': 46, 'max_leaf_nodes': 30, 'max_depth': 8, 'l2_regularization': 6.9116389018452615}. Best is trial 1 with value: 0.9874680546222769.
ROC AUC (avg 5-fold): 0.9840093322411929
[I 2024-10-14 19:45:32,601] Trial 3 finished with value: 0.9866128607700897 and parameters: {'max_iter': 98, 'max_leaf_nodes': 25, 'max_depth': 7, 'l2_regularization': 8.083130610131082}. Best is trial 1 with value: 0.9874680546222769.
ROC AUC (avg 5-fold): 0.9866128607700897
[I 2024-10-14 19:45:33,104] Trial 4 finished with value: 0.9842784189470001 and parameters: {'max_iter': 50, 'max_leaf_nodes': 30, 'max_depth': 4, 'l2_regularization': 4.956563410882804}. Best is trial 1 with value: 0.9874680546222769.
ROC AUC (avg 5-fold): 0.9842784189470001
[I 2024-10-14 19:45:33,844] Trial 5 finished with value: 0.9853035076561547 and parameters: {'max_iter': 88, 'max_leaf_nodes': 20, 'max_depth': 4, 'l2_regularization': 8.14038787213968}. Best is trial 1 with value: 0.9874680546222769.
ROC AUC (avg 5-fold): 0.9853035076561547
[I 2024-10-14 19:45:34,213] Trial 6 finished with value: 0.9605465443077088 and parameters: {'max_iter': 52, 'max_leaf_nodes': 23, 'max_depth': 2, 'l2_regularization': 8.471465142876704}. Best is trial 1 with value: 0.9874680546222769.
ROC AUC (avg 5-fold): 0.9605465443077088
[I 2024-10-14 19:45:35,314] Trial 7 finished with value: 0.9880588949218199 and parameters: {'max_iter': 60, 'max_leaf_nodes': 16, 'max_depth': 9, 'l2_regularization': 0.5451546397377538}. Best is trial 7 with value: 0.9880588949218199.
ROC AUC (avg 5-fold): 0.9880588949218199
[I 2024-10-14 19:45:36,131] Trial 8 finished with value: 0.9853360920231792 and parameters: {'max_iter': 66, 'max_leaf_nodes': 13, 'max_depth': 7, 'l2_regularization': 8.242736027057564}. Best is trial 7 with value: 0.9880588949218199.
ROC AUC (avg 5-fold): 0.9853360920231792
[I 2024-10-14 19:45:36,463] Trial 9 finished with value: 0.9766220355120007 and parameters: {'max_iter': 20, 'max_leaf_nodes': 26, 'max_depth': 6, 'l2_regularization': 7.230381154178385}. Best is trial 7 with value: 0.9880588949218199.
ROC AUC (avg 5-fold): 0.9766220355120007
[I 2024-10-14 19:45:36,927] Trial 10 finished with value: 0.985779645969953 and parameters: {'max_iter': 25, 'max_leaf_nodes': 16, 'max_depth': 10, 'l2_regularization': 0.0732553647420362}. Best is trial 7 with value: 0.9880588949218199.
ROC AUC (avg 5-fold): 0.985779645969953
[I 2024-10-14 19:45:37,644] Trial 11 finished with value: 0.9877744043502501 and parameters: {'max_iter': 78, 'max_leaf_nodes': 18, 'max_depth': 5, 'l2_regularization': 0.13390436092695102}. Best is trial 7 with value: 0.9880588949218199.
ROC AUC (avg 5-fold): 0.9877744043502501
[I 2024-10-14 19:45:38,398] Trial 12 finished with value: 0.9877405992275451 and parameters: {'max_iter': 73, 'max_leaf_nodes': 16, 'max_depth': 6, 'l2_regularization': 2.094855323880429}. Best is trial 7 with value: 0.9880588949218199.
ROC AUC (avg 5-fold): 0.9877405992275451
[I 2024-10-14 19:45:38,909] Trial 13 finished with value: 0.986631477294217 and parameters: {'max_iter': 41, 'max_leaf_nodes': 17, 'max_depth': 10, 'l2_regularization': 0.02385230743064448}. Best is trial 7 with value: 0.9880588949218199.
ROC AUC (avg 5-fold): 0.986631477294217
[I 2024-10-14 19:45:39,574] Trial 14 finished with value: 0.9872480134108731 and parameters: {'max_iter': 84, 'max_leaf_nodes': 13, 'max_depth': 5, 'l2_regularization': 2.5789693329685592}. Best is trial 7 with value: 0.9880588949218199.
ROC AUC (avg 5-fold): 0.9872480134108731
[I 2024-10-14 19:45:39,836] Trial 15 finished with value: 0.9535010073959265 and parameters: {'max_iter': 37, 'max_leaf_nodes': 22, 'max_depth': 2, 'l2_regularization': 3.560063808971707}. Best is trial 7 with value: 0.9880588949218199.
ROC AUC (avg 5-fold): 0.9535010073959265
[I 2024-10-14 19:45:40,272] Trial 16 finished with value: 0.9869683186651909 and parameters: {'max_iter': 65, 'max_leaf_nodes': 10, 'max_depth': 8, 'l2_regularization': 1.0863229162385684}. Best is trial 7 with value: 0.9880588949218199.
ROC AUC (avg 5-fold): 0.9869683186651909
[I 2024-10-14 19:45:41,020] Trial 17 finished with value: 0.9878630780809502 and parameters: {'max_iter': 100, 'max_leaf_nodes': 18, 'max_depth': 5, 'l2_regularization': 3.402628540634783}. Best is trial 7 with value: 0.9880588949218199.
ROC AUC (avg 5-fold): 0.9878630780809502
[I 2024-10-14 19:45:41,826] Trial 18 finished with value: 0.9860504466385347 and parameters: {'max_iter': 100, 'max_leaf_nodes': 14, 'max_depth': 8, 'l2_regularization': 9.829015159948952}. Best is trial 7 with value: 0.9880588949218199.
ROC AUC (avg 5-fold): 0.9860504466385347 ROC AUC (avg 5-fold): 0.970726919134201
[I 2024-10-14 19:45:42,021] Trial 19 finished with value: 0.970726919134201 and parameters: {'max_iter': 13, 'max_leaf_nodes': 21, 'max_depth': 3, 'l2_regularization': 3.4691508007009113}. Best is trial 7 with value: 0.9880588949218199.
In [32]:
print(best_trial.params)
{'max_iter': 60, 'max_leaf_nodes': 16, 'max_depth': 9, 'l2_regularization': 0.5451546397377538}
Final Model¶
In [35]:
best_gbt = HistGradientBoostingClassifier(**best_trial.params)
best_gbt.fit(X_train, y_train)
test_preds = best_gbt.predict_proba(X_test)
print('Test ROC AUC:', roc_auc_score(y_test, test_preds[:, 1]))
print('Test PR AUC:', average_precision_score(y_test, test_preds[:, 1]))
Test ROC AUC: 0.99004329004329 Test PR AUC: 0.9813931717932178
In [37]:
fpr, tpr, thresholds = roc_curve(y_test, test_preds[:, 1])
fig = px.area(
x=fpr, y=tpr,
title=f'ROC Curve (AUC={auc(fpr, tpr):.4f})',
labels=dict(x='False Positive Rate', y='True Positive Rate'),
width=700, height=500
)
fig.add_shape(
type='line', line=dict(dash='dash'),
x0=0, x1=1, y0=0, y1=1
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.update_xaxes(constrain='domain')
fig.show()
In [39]:
precision, recall, thresholds = precision_recall_curve(y_test, test_preds[:, 1])
fig = px.area(
x=recall, y=precision,
title=f'Precision-Recall Curve (AUC={auc(recall, precision):.4f})',
labels=dict(x='Recall', y='Precision'),
width=700, height=500
)
fig.add_shape(
type='line', line=dict(dash='dash'),
x0=0, x1=1, y0=1, y1=0
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.update_xaxes(constrain='domain')
fig.show()
Observations¶
- The ROC AUC is about 0.0125 higher with the best model compared to the initial model
Threshold Analysis & Evaluation¶
In [43]:
thresholds = np.arange(0, 1.01, 0.01)
rcs, prs, f1s = evaluate_thresholds(thresholds, y_test, test_preds, plot=True)
Threshold with Max F1 Score: 0.59 F1 at threshold 0.59: 0.9263157894736842 Recall at threshold 0.59: 0.8888888888888888 Precision at threshold 0.59: 0.967032967032967
Observations¶
$$ \text{Precision} = \frac{TP}{TP + FP} $$$$ \text{Recall} = \frac{TP}{TP + FN} $$$$ F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$- The threshold for our max F1 score is 0.59, where recall is 0.89 and precision is 0.97
- From the plot, notice that when threshold is at around 0.3, all 3 lines converge. This could mean that at this threshold, the number of false positives is equal to the number of false negatives (derived from the mathematical formulas for f1, precision, and recall above)
Extension of this project¶
- We will investigate the actual number of FP and TP at this threshold later in the notebook
Model Diagnostics¶
Feature Importances¶
In [48]:
explainer = shap.Explainer(best_gbt)
shap_values = explainer(X_test[numerical_features].to_pandas())
shap.plots.beeswarm(shap_values)
Observations¶
- having a low number of connections is a strong indicator of anomalous behavior
False Positives & False Negatives¶
In [52]:
thr = 0.3
test_binary_pred = test_preds[:, 1] >= thr
fps = np.where((np.array(y_test) == False) & (test_binary_pred == True))[0]
fns = np.where((np.array(y_test) == True) & (test_binary_pred == False))[0]
print(f"There are {len(fps)} false positives")
print(f"There are {len(fns)} false negatives")
There are 8 false positives There are 8 false negatives
Observations¶
- as mentioned previously, at a threshold of 0.3, we see that there are an equal number of FP and FN (8 false positives and 8 false negatives)
In [55]:
shap.plots.waterfall(shap_values[fps[0]])
In [56]:
shap.plots.waterfall(shap_values[fns[0]])
Observations¶
- The examples of false positives and false negatives above show that number of connections has a strong influence on the classification of an example
Dump Model for Deployment¶
In [59]:
joblib.dump(best_gbt, '../model/best_hgbt.joblib')
Out[59]:
['../model/best_hgbt.joblib']
Summary¶
- The modeling task for anomaly detection is solved fairly easily by training supervised ML model
- To make the task more challenging, only the engineered features were used for anomaly detection
- The optimal model trained is HistGBT with threshold of 0.33
- The maximum F1 that can be achieved with the trained model is 0.927, which also gives us a recall of 0.927 and precision 0.927
- The trained model outperforms the heuristic baseline by ~0.15 in terms of F1 score
- SHAP values suggest that smaller networks tend to be more anomalous